Replace Python ANTLR parser with C++ parser via pybind11#588
Draft
javihern98 wants to merge 15 commits intomainfrom
Draft
Replace Python ANTLR parser with C++ parser via pybind11#588javihern98 wants to merge 15 commits intomainfrom
javihern98 wants to merge 15 commits intomainfrom
Conversation
Replace the antlr4-python3-runtime dependency with a C++ ANTLR parser exposed through pybind11, achieving 95.7% performance improvement (9.6s → 0.41s for 8000 statements). Key changes: - Add C++ ANTLR parser with pybind11 lazy-wrapping bindings - Refactor ASTConstructor to use (rule_index, alt_index) dispatch - Switch build backend from poetry-core to scikit-build-core - Update CI workflows for C++ compilation and cibuildwheel - Remove antlr4-python3-runtime dependency - Delete dead files: lexer.py, parser.py, VtlVisitor.py
Fix ruff I001 import sorting errors across 7 files after moving _cpp_parser into Grammar/. Update cibuildwheel config: test-requires as array, per-platform before-build commands.
poetry install doesn't invoke scikit-build-core to compile the C++ extension. Use pip install . (with build isolation) after installing deps to actually build the .so module. Update version.yml to not require poetry for version extraction.
Testing workflow: copy the compiled C++ extension from site-packages back to the source tree so mypy can resolve the import. Ubuntu 24.04: use --no-deps to avoid upgrading system numpy/pandas which causes binary incompatibility errors.
The YAML folded scalar (>) was preserving leading whitespace in the python -c command, causing IndentationError. Use single-line command.
Use follow_imports = "silent" for vtlengine.AST.* in mypy config to suppress errors from AST files when the C++ extension .so isn't in the source tree (CI builds install to site-packages only). Remove the copy C++ extension step from the testing workflow.
Instead of silencing all AST modules, target only: - _cpp_parser: follow_imports=silent (handles missing .so in CI) - ASTConstructor + ASTConstructorModules: disallow_untyped_calls=false
No need to build the C++ parser just to check version consistency. Extract __version__ with grep and pyproject version with tomllib, removing the ANTLR download and pip install steps entirely.
Pure bash version check using grep — no Python, no build needed.
Use actions/cache to store the built wheel keyed on OS, Python version, and hash of C++ source files. On cache hit, ANTLR download and C++ compilation are skipped entirely — only the wheel install runs.
The missing .so cascades errors through all AST files, not just the constructor. Use follow_imports=silent for vtlengine.AST.* which matches the existing exclude pattern's intent.
- Define ANTLR4CPP_STATIC to avoid dllimport errors on Windows - Use /w instead of -w for MSVC warning suppression - Broaden mypy follow_imports=silent to all vtlengine.AST.*
Not needed for parsing and causes MSVC build errors with high_resolution_clock on Windows.
ProfilingATNSimulator is referenced by other ANTLR runtime code, so it can't be excluded. Use /FI"chrono" on MSVC to fix the missing high_resolution_clock symbol.
Use setup-python's built-in poetry cache for faster dependency installs. Combine dependency install and wheel install into one step.
Contributor
Author
|
This one is posponed until further notice due to the analysis downstream over the changes in the build process and how can it impact the installation of the library in very constrained scenarios. Scheduled to review in April-May |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
antlr4-python3-runtimewith a C++ ANTLR parser exposed via pybind11, achieving 95.7% performance improvement (9.6s → 0.41s for 8000 statements)poetry-coretoscikit-build-corefor C++ extension compilationcibuildwheelfor Python 3.9-3.13 on Linux, macOS, and WindowsChanges
C++ Parser (
src/vtlengine/AST/Grammar/_cpp_parser/)bindings.cpp: pybind11 module with lazy-wrappingParseNode/TerminalNodeclasses_rule_constants.py: 223(rule_index, alt_index)constants for isinstance replacementVtl.g4(ANTLR 4.13.1)ASTConstructor Refactoring
isinstance(ctx, Parser.XxxContext)→ctx.ctx_id == RC.XXXisinstance(x, TerminalNodeImpl)→x.is_terminalx.getSymbol().text→x.textlist(ctx.getChildren())→ctx.childrenBuild System
pyproject.toml: scikit-build-core backend, cibuildwheel configCMakeLists.txt: builds pybind11 extension with vendored ANTLR4 C++ runtimescripts/setup_antlr4_runtime.sh: downloads ANTLR4 C++ runtime for developmentANTLR4CPP_STATICdefine,/FI"chrono"for ProfilingATNSimulatorCI Workflows
actions/cache(keyed on OS, Python version, and source hash). On cache hit, ANTLR download and C++ compilation are skipped entirely. Poetry dependency cache viasetup-python--no-depsto avoid numpy/pandas conflicts with system packagesgrep/tomllib, no build requiredcibuildwheel(no sdist)CI Performance (Testing workflow)
Deleted Files
src/vtlengine/AST/Grammar/lexer.py(2140 lines)src/vtlengine/AST/Grammar/parser.py(16415 lines)src/vtlengine/AST/VtlVisitor.py(906 lines)src/vtlengine/AST/Grammar/runtime_patches.pysrc/vtlengine/AST/Grammar/fast_lexer.pyNew Files
scripts/check_version.sh: standalone version consistency checkscripts/setup_antlr4_runtime.sh: downloads ANTLR4 C++ runtime for developmentMG01 Benchmark (rc6 vs rc7, 3 runs each)
Real-world benchmark using the MG01 VTL validation suite (7,387 AST statements, 7,379 output datasets). Outputs are identical between both versions.
create_astsemantic_analysisTest plan